This vignette introduces scBubbletree, a transparent methodology for single cell RNA-seq data exploration based on well established methods clustering and visualization. We will demonstrate the functionality of scBubbletree by analyzing three scRNA-seq datasets (Case studies A, B and C), while also showing how to integrate scBubbletree with existing pipelines for scRNA-seq analysis e.g. based on Seurat.

To run this vignette we need several R-packages. Load them now:

source(file = "../Rutil/Graphics.R")

source(file = "../scBubbletree/R/util.R")
source(file = "../scBubbletree/R/main.R")
source(file = "../scBubbletree/R/annotation.R")
#library(scBubbletree, lib.loc = lib.loc)
library(cluster, lib.loc = lib.loc)
library(Seurat, lib.loc = lib.loc)
library(ggplot2, lib.loc = lib.loc)
library(reshape2, lib.loc = lib.loc)
library(parallel, lib.loc = lib.loc)
library(ape, lib.loc = lib.loc)
library(ggtree, lib.loc = lib.loc)
library(org.Hs.eg.db, lib.loc = lib.loc)
library(bluster, lib.loc = lib.loc)
library(SummarizedExperiment, lib.loc = lib.loc)
library(ggrepel, lib.loc = lib.loc)

Case study A: 3,918 cells from 5 cancer cell lines 1

Data

In this case study we will analyze scRNA-seq mixture of 3,918 cells derived from 5 adenocarcinoma cell lines: H2228, H1975, HCC827, H838 and A549

Cell types (ground truth) were inferred for each cell on the basis of known genetic variation with demuxlet. We will use this simple dataset to demonstrate the advantages of quantitative scRNA-seq data exploration with scBubbletree in combination with Seurat preprocessing.

Load the data and perform some basic scRNA-seq data processing steps with Seurat:

# Lets load the benchmark data
load(file = "raw_data/Tian_2019/sc_mixology-master/data/sincell_with_class_5cl.RData")


# We are only interested in the 10x data object 'sce_sc_10x_5cl_qc'
d <- sce_sc_10x_5cl_qc


# Remove the remaining objects
rm(sc_Celseq2_5cl_p1, sc_Celseq2_5cl_p2, sc_Celseq2_5cl_p3, sce_sc_10x_5cl_qc)


# Get the meta data for each cell
meta <- colData(d)[, c("cell_line_demuxlet", "non_mt_percent", "total_features")]


# Create Seurat object from the raw counts and append the meta data to it
d <- Seurat::CreateSeuratObject(counts = d@assays$data$counts,
                                project = '')

# check if all cells are matched between d and meta
# table(rownames(d@meta.data) == meta@rownames) 
d@meta.data <- cbind(d@meta.data, meta@listData)


# cell type predictions are provided as part of the meta data
table(d@meta.data$cell_line)

# select 5,000 most variable genes
d <- Seurat::FindVariableFeatures(object = d, 
                                  selection.method = "vst", 
                                  nfeatures = 5000)

# Preprocessing with Seurat: SCT transformation + PCA + UMAP 
d <- SCTransform(object = d)
d <- RunPCA(object = d, npcs = 50, features = VariableFeatures(object = d))
d <- RunUMAP(d, dims = 1:50)

Exploring 2D UMAPs is by now a standard first step in scRNA-seq data analysis. As this is a toy dataset composed of distinct cell lines, we should be able to interpret the resulting 2D UMAP without much trouble.

The 2D UMAP (left panel) appears to contain between 5 and 8 clusters of cells. After color-coding the cells according to their predicted cell types (right panel) we see the 5 clusters (cell lines), with some substructure also visible within the clusters.

While 2D UMAPs are intuitive, by themselves these maps fail to provide the user with quantitative information about the scRNA-seq data. For instance, due to the high overplotting it difficult to tell how many cells are found in the different clusters. Such basic information on the composition of the sample, we think, is essential to check whether the experiment has worked as planned.

Furthermore, techniques such as UMAP and t-SNE primarily focus on preserving local cell distances, which makes the interpretation of global distances challenging. This means that we cannot interpret distances between cells in the UMAP as distances in Euclidean space, which also implies that we cannot learn about the cell type structure in our data directly from the 2D UMAP or t-SNE plots.

All of these challenges will be exaggerated in more realistic scenarios, such as scRNA-seq data derived from complex (heterogeneous) tissues and also composed of 10- to 100-times higher number of cells. In such analyses we will see crowded 2D UMAPs composed of many clusters of cells that are not clearly separated from each other. We will encounter many of these challenges in Case Study B.

Using 2D UMAPs in publications presents a challenge not only for the readers but also for the reviewers, who have to evaluate these complex figures without having access to the raw data.

scBubbletree approach

For a more quantitative and transparent exploration and visualization of scRNA-seq we developed scBubbletree.

As first input scBubbletree uses matrix \(A^{n\times f}\) which represents a low-dimensional projection of the original scRNA-seq data, with \(n\) rows as cells and \(f\) columns as low-dimension features. We will use the PCA data generated by Seurat as \(A\). In particular, we will use the first 15 principal components (PCs).

# This is the main input of scBubbletree -> matrix A
A <- d@reductions$pca@cell.embeddings[, 1:15]


# A has n=cells as rows, f=features as columns (e.g. from PCA)
dim(A)
FALSE [1] 3918   15

The scBubbletree algorithm performs these main operations:

  1. determine k
  2. clustering with k-means with k
  3. hierarchical organization of clusters (bubbles)
  4. visualization & model assessment

As scBubbletree uses the well known k-means clustering of \(A\) to identify clusters of transcriptionally similar cells, our first goal is to determine k, i.e. the number of clusters in the data. This can be achieved with the function get_k.

get_k performs k-means clustering B times (bootstrapping iterations) for a vector of k values specified by the parameter ks. For this we draw a sample of rows from \(A\) with replacement. The proportion of rows sampled is controlled by cv_clust_p.

get_k then computes three metrics for each k and B which allow us to find the optimal k:

  • silhouette index
  • gap statistic
  • within sum of squares (WSS) for elbow method

The remaining parameters of get_k are explained below:

  • n_start and iter_max: used to tune k-means (see ?kmeans)
  • cores: computer cores to be used
  • cv_index_p proportion of cells to be used for the computation of the gap statistic. For small samples (e.g. < 20,000 cells), you can use the default parameter. For larger samples even low cv_index_p=0.1 produces robust results while speeding up the computation. Smaller B (e.g. B=100) might be helpful to execute this function in several minutes less than one hour for large datasets.

Lets run get_k now.

if(redo_case_a) {
  # Determine appropriate number of clusters (k)
  b <- get_k(B = 100, 
             cv_clust_p = 1, 
             cv_index_p = 1, 
             ks = 1:20,
             x = A,
             n_start = 10, 
             iter_max = 50, 
             cores = 20)
  
  if(dir.exists("case_study_A")==F) {
    dir.create("case_study_A")
  }
  save(b, file = "case_study_A/b.RData")
} else {
  b <- get(load("~/scBubbletree/case_study_A/b.RData"))
}

Mean index values and 95% confidence intervals obtained from get_k are shown below. The raw data for each bootstrap iteration are also stored as part of the object b.

k is is easily discernible from each of the three metrics (panels).

The silhouette index peaks around k=5 (left panel) and for larger ks the silhouette index starts to drop and is numerically less stable. Hence, we can conclude that k values between 4 and 6 seem like a reasonable choice for this dataset, which is consistent with the 2D UMAP structure shown before. This conclusion is confirmed by the gap statistic (middle panel) and the elbow method based on WSS (right panel), i.e. we see a knee in the curves at k=5.

Lets start with the simplest model (segmentation) k=5 and use scBubbletree to perform clustering and use the clustering data to generate a dendrogram, showing the clusters as bubbles (tree leaves) in the dendrogram.

# Perform clustering to get data for scBubbletree
k5 <- get_bubbletree(x = A,
                     k = 5,
                     seed = 1234,
                     cores = 1,
                     B = 100,
                     N_eff = 100,
                     verbose = F,
                     n_start = 100,
                     iter_max = 100,
                     round_digits = 1,
                     show_branch_support = T)
FALSE Clustering ... 
FALSE Bubbletree construction ...

Lets now plot the resulting dendrogram:

k5$tree

Bubbles

The generated dendrogram has k=5 bubbles (clusters) shown as tree leaves. The radius of each bubble is scaled as function of the the number of cells that belong to the corresponding cluster. Analogously, the bubbles are color-coded according to their sizes, i.e. dark bubbles are larger and have many cells, and bright bubbles are small and contain few cells. The absolute and relative cell frequencies in the different bubbles are shown as labels.

Bubble 3 is the largest (and darkest) one in the dendrogram and contains 1,253 cells (\(\approx\) 32% of all cells in the dataset). Bubble 5 is the smallest one (and brightest) and contains only 436 cells (\(\approx\) 11% of all cells in the dataset).

We can access the bubble data shown in the bubbletree with the following code:

knitr::kable(k5$tree_meta, digits = 1)
label c n p pct tree_order
5 5 436 3918 0.1 11.1 5
4 4 593 3918 0.2 15.1 4
3 3 1253 3918 0.3 32.0 3
1 1 761 3918 0.2 19.4 2
2 2 875 3918 0.2 22.3 1

Hierarchical tree structure

The average distances between a pair of bubbles are represented by the sums of branch length in the tree. This information is included as part of the object k5:

# c_i, c_j = pair of clusters/bubbles
# M = mean inter-cluster dissimilarity
# L95/H95 = lower/upper bounds 95% confidence 
# interval of the mean dissimilarity
knitr::kable(k5$pair_dist$hc_pair_dist, digits = 1)
c_i c_j M SE L95 H95
1 1 0.0 0.0 0.0 0.0
1 2 86.7 0.1 86.5 86.8
1 3 84.5 0.1 84.2 84.7
1 4 84.1 0.1 83.9 84.3
1 5 84.1 0.1 83.9 84.3
2 1 86.7 0.1 86.5 86.8
2 2 0.0 0.0 0.0 0.0
2 3 85.6 0.2 85.2 86.1
2 4 86.3 0.1 86.0 86.5
2 5 86.3 0.1 86.0 86.5
3 1 84.5 0.1 84.2 84.7
3 2 85.6 0.2 85.2 86.1
3 3 0.0 0.0 0.0 0.0
3 4 82.3 0.2 82.0 82.6
3 5 82.3 0.2 82.0 82.6
4 1 84.1 0.1 83.9 84.3
4 2 86.3 0.1 86.0 86.5
4 3 82.3 0.2 82.0 82.6
4 4 0.0 0.0 0.0 0.0
4 5 72.6 0.1 72.4 72.7
5 1 84.1 0.1 83.9 84.3
5 2 86.3 0.1 86.0 86.5
5 3 82.3 0.2 82.0 82.6
5 4 72.6 0.1 72.4 72.7
5 5 0.0 0.0 0.0 0.0

In general we see similar distances between the different pairs of bubbles. The only exception is the somewhat smaller distance between the bubbles 4 and 5. The tree is also devoit of structure, which makes sense as the input data is composed of cells derived from from 5 distinct cell lines. In case study B, we will analyze scRNA-seq data from PBMCs, where cell type structure is evident, i.e. we will see branches in the corresponding phylogenetic tree that are formed by transcriptionally related clusters of cells (e.g. CD4+ and CD8+ T-cells).

Tree topology robustness

By default the function get_bubbletree annotates the branches of the bubbletree with their bootstrap support. This tells us the frequency with which a given branch in the tree is found the \(B\) bootstrap trees. Above we called get_bubbletree with B=100, and two of the tree branches have complete support (100 out of 100) while the branch that joins bubbles (4, 5) and bubble 3 has lower support (38 out of 100 => 76%). This tells us that the corresponding branch is not robust.

One way of visualizing the branch support is by using density trees. For this we can use function ggdensitree provided by ggtree and provide as input the \(B\) phylogenetic trees generated during the boostraping procedure and stored in the object k5:

ggtree::ggdensitree(k5$ph$boot_ph, alpha=0.1, colour='steelblue')+
  geom_tiplab(size = 7)+
  hexpand(ratio = 0.5)

Can we verify that this clustering is sensible?

Are A549 and HCC827 transcriptionally similar?

We downloaded gene expression data for 1,019 human cancer cell lines from: https://www.ebi.ac.uk/gxa/experiments/E-MTAB-2770/Downloads and extracted data (gene TPM values) on 69 lung adenocarcinoma cell lines.

From this data we computed Euclidean distances between pairs of cell lines and generated a hierarchical clustering dendrogram with average linkage.

It turns out that A549 and HCC827 share a common branch in this dendrogram => similar gene expression profiles are accurately described by the bubbletree topology. Other cell lines are dispersed in different branches of the dendrogram and hence we do not see any special structure connecting the corresponding bubbles of the bubbletree.

tpm <- read.csv(file = "manuscript_data/E-MTAB-2770-query-results.tpms.tsv", 
                sep = "\t", comment.char = "#")


tpm <- tpm[, c(1, 2, which(regexpr(pattern = "lung\\.adenocarcinoma", 
                                   text = colnames(tpm)) != -1))]
rownames(tpm) <- tpm$Gene.ID
tpm$Gene.ID <- NULL
tpm$Gene.Name <- NULL
tpm <- t(tpm)
rownames(tpm) <- gsub(pattern = "\\.\\.papillary\\.lung\\.adenocarcinoma|\\..lung\\.adenocarcinoma|NCI\\.", 
                      replacement = '', x = rownames(tpm))
rownames(tpm) <- gsub(pattern = "\\.",
     replacement = '',
     x = rownames(tpm))


# require(compositions)
# atpm <- compositions::acomp(tpm)
# euc_atpm <- dist(x = atpm, method = "euclidean")
# ah <- hclust(d = dist(atpm, method = "euclidean"), method = "average")

h <- hclust(d = dist(tpm, method = "euclidean"), method = "average")

five_cell_lines <- c("H838", "H2228", "A549", "H1975", "HCC827")

cols <- rep(x = "black", times = length(h$labels))
cols[which(h$labels %in% five_cell_lines)] <- "red"

pdf(width = 8.5, height = 4.5,
    file = "manuscript_data/Fig_S1.pdf")
plot(as.dendrogram(h),
     label.offset = 1, nodePar = list(lab.cex = 0.7, cex = 0.7,
                                      pch = c(NA, 19)))
dev.off()
FALSE png 
FALSE   2
plot(as.dendrogram(h),
     label.offset = 1, nodePar = list(lab.cex = 0.7, cex = 0.7,
                                      pch = c(NA, 19)))

Visualizing features

Visualizing features of the cells in the different bubbles in the bubbletree can help us understand the biology behind different clusters (bubbles) as well as the structure of the tree. For instance, we may want to show the average expression of a certain marker gene in each bubble, or maybe the percent of mitochondrial gene content, both of which are numeric features.

Cells may also have categorical features. For instance: a) a cell may have predicted cell type label; b) a cell may be categorized as a singlet, doublet or multiplet; c) for a given cell we may have or have not found a B-cell receptor sequence in a corresponding immune profiling library.

In the next two paragraph we will explain how to ‘attach’ numeric and categorical features to the bubbletree using scBubbletree.

Categorical features

Categorical features can be ‘attached’ to the bubbletree using the function get_cat_feature_tiles. One of the categorical features in this case study are the predicted cell types. With get_cat_feature_tiles we can show make two types of visualization:

First, we can show the relative frequency distribution of a feature across the five bubbles (with parameter feature_composition=T). Second, we can show the cell type composition of each bubble (with parameter feature_composition=F):

w1 <- get_cat_feature_tiles(d = k5,
                            a = d@meta.data$cell_line_demuxlet,
                            feature_composition = T,
                            round_digits = 1,
                            rotate_x_axis = T)

scBubbletree uses the R-package patchwork to combine plots. Lets merge the bubbletree and the numeric features plot:

(k5$tree|w1$w)+patchwork::plot_layout(ncol = 2, widths = c(1, 1))

This plot tells us that each cells type is nearly exclusively (>99%) found in its own bubble, i.e. columns integrate to 100%. For instance, 99.76% of the A549 cells are found in bubble 3, 99.09% of the H1975 cells are found in bubble 5. Few cells are mixed between the bubbles. This mixing is also present in the 2D UMAP, however, doe to the heavy overplotting they are not visible. Our approach is more transparent in this sense, and show in a quantitative way the clustering output.

We might also be interested in checking the ‘purity’ of individual bubbles. For this we will draw a similar plot but make sure that the rows integrate to 100%:

w2 <- get_cat_feature_tiles(d = k5,
                            a = d@meta.data$cell_line_demuxlet,
                            feature_composition = TRUE)

(k5$tree|w2$w)+patchwork::plot_layout(ncol = 2, widths = c(1, 1))

The above feature plot informs us that 100% of the cells in bubble 5 belong to cell type H1975. About 0.17% of the 593 cells in bubble 4 belong to cell A549. This is equal to 0.17/100*593 = 1 cell, while the majority of cells 98.99 belong to cell HCC827. Hence, we see quite high bubble purity.

Gini impurity index

Using the Gini impurity index we can quantify the goodness of the clustering with k=5 and the impurity of the individual bubbles. This functionality is implemented by the function get_gini.

FALSE $cluster_gini
FALSE           4           2           5           1           3 
FALSE 0.020099588 0.004563592 0.000000000 0.010474495 0.000000000 
FALSE 
FALSE $total_gini
FALSE [1] 0.006095786

All cluster-specific Gini impurity indices are close to 0 and thus also the total (global) impurity has a value close to 0 as well. This indicates nearly perfect clustering of cell types across bubbles. With this function, scBubbletree provides a quantitative way of summarizing the tile plots shown earlier.

scBubbletree also implement the function get_gini_boot, which allows us to integrate clustering data obtained from get_k with cell type predictions (labels), and to show quantitatively the Gini impurity index as a function of k. We can conclude that at k=5 all labels are nearly perfectly segmented across the bubbles, and each bubble contains cells exclusively from one cell type (1-to-1 mapping between bubbles and cell types as seen shown by the tile plots).

gini_boot <- get_gini_boot(labels = d@meta.data$cell_line_demuxlet,
                           kmeans_boot_obj = b)
g1 <- ggplot(data = gini_boot$total_gini_summary)+
  geom_point(aes(x = k, y = total_gini_mean), size = 1)+
  geom_errorbar(aes(x = k, y = total_gini_mean, ymin = L95, 
                    ymax = H95), width = 0.1)+
  ggtitle(subtitle = "Gini impurity", label = '')+
  ylab(label = "Mean impurity")

g1

Numeric annotations

scBubbletree also implements add-ons for visualization of numeric cell features, such as gene expression, number of reads or features, mitochondrial content etc. Lets visualize the expression of five marker genes, i.e. one marker gene for each of the five cancer cell lines.

# First we need to select gene expressions for each cell and 
# also for five marker genes
as <- as.matrix(t(d@assays$SCT@data[
  rownames(d@assays$SCT@data) %in% 
    c("ALDH1A1", 
      "PIP4K2C", 
      "S100A9", 
      "PEG10",
      "SLPI"), ]))

# 'as' is a matrix with n=rows for cells and a=columns for 
# annotations (genes). The column names will be shown in
# the plot.

# We will order the columns in 'as' in the same way we want
# them to be plotted. These genes are markers for: A549, 
# HCC827, H1975, H2228 and H838
as <- as[, c("ALDH1A1", 
             "PIP4K2C", 
             "S100A9", 
             "PEG10",
             "SLPI")]

We can visualize numeric features in two ways.

First, we can show the feature averages (in this example: average gene expressions) in each bubble with get_num_feature_tiles. Lets invoke this function now:

# This function build the nummeric annotations plot
w3 <- get_num_feature_tiles(d = k5,
                            as = as,
                            plot_title = "",
                            round_digits = 1)

# Plot
(k5$tree|w3$w)+patchwork::plot_layout(widths = c(1, 1))

Second, we can visualize the distributions of the numeric features in each bubble as violins, while the cell-specific values are shown as jittered points. We can do this with get_num_feature_violins. This function uses the same input as get_num_feature_tiles:

w4 <- get_num_feature_violins(d = k5,
                             as = as,
                             plot_title = "",
                             scales = 'free_x',
                             show_cells = F)

(k5$tree|w3$w|w4$w)+patchwork::plot_layout(widths = c(1.5, 1.1, 2.5))

Compositional visualization

scBubbletree uses the R-package patchwork to combine ggplot2 and ggtree plots. This makes it convenient for users to ‘attach’ other plots that are generated by any of these package.

Lets combine the UMAP plots shown earlier with the output of scBubbletree:

Summary

scBubbletree is intended to promote simple and transparent analysis of scRNA-seq. The user is encouraged to optimize k and at each stage to perform feature annotation to understand the biology behind the new bubbles. At the end, user of scBubbletree must be able answer the following questions:

  1. is the selected k biologically justifiable?
  2. is the structure of the generated bubbletree biologically meaningful? For instance:
    • are relative sizes of bubbles (cell-types) consistent with prior knowledge?
    • are distances between pairs of bubbles in the tree smaller between similar cell types and the corresponding branches robust?
    • does the bubble segmentation correspond with the feature annotation data (e.g. marker gene expressions)?

We will provide answer to these questions in Case study B.

scBubbletree can incorporate clustering results from approaches

Wide range of clustering approaches are used to cluster scRNA-seq data. To accommodate for this scBubbletree implements the function dummy_bubbletree which can generate a bubbletree using segmentation of alternative approaches.

Lets do this using k-medoids clustering implemented as part of the R-package cluster:

FALSE Bubbletree construction ...

Case study B: 161,000 PBMCs from 8 healthy donors 2,3

Data

In this case study we will analyze scRNA-seq dataset of approximately 161,000 PBMCs derived from 8 healthy donors, collected at three timepoints and sequenced by 10x Genomics protocol. This is a large and complex scRNA-seq dataset, perfect for demonstrating the advantageS of scBubbletree over qualitative analyses e.g. based on 2D UMAPs.

In addition to gene expression data, this dataset reports for each cell a cell type prediction at three levels of resolution (l1, l2 and l3). The cell types have been predicted based on marker genes, and we will use them as categorical cell annotations. As this dataset is part of a larger CITE-seq experiment, the cell type predictions are made based on scRNA-seq derived gene expressions and also on the cell-surface protein expression levels of up to 228 marker proteins.

In essence, we will perform the same steps as in Case study A:

  1. determine k \(\rightarrow\) with \(f:\) get_k
  2. perform clustering and generate dendrogram \(\rightarrow\) with \(f:\) get_bubbletree
  3. inspect the dendrogram structure and annotate it with categorical and quantitative cell meta data \(\rightarrow\) with \(fs:\) get_cat_feature_tiles,get_num_feature_tiles,get_num_feature_violins
  4. assess the model: criticize the results \(\rightarrow\) with \(f_s:\) update_bubbletree, get_gini, get_gini_boot
  5. if necessary modify k and go back to 2.

The raw gene expressions have already been preprocessed with Seurat. SCT transformation was used for normalization, followed by principal component analysis for dimensionality reduction. 50 lower-space features have been extracted and used to generate a 2D UMAP.

Lets load the data.

# To get the data used in this case study do the following steps:
# 1. download reference data from vignette:
# https://satijalab.org/seurat/articles/multimodal_reference_mapping.html
# 2. load SeuratDistk
# library(SeuratDisk, lib.loc = lib.loc)
# d <- LoadH5Seurat("~/BubbleMap/raw_data/Hao_2021/pbmc_multimodal.h5seurat")
# save(d, file = "raw_data/Hao_2021/Hao_2021.RData")
d <- get(load(file = "raw_data/Hao_2021/Hao_2021.RData"))

Lets look at the number of cells per donor at a given sampling time:

table(d@meta.data$time, d@meta.data$donor)
FALSE    
FALSE       P1   P2   P3   P4   P5   P6   P7   P8
FALSE   0 6443 5978 4698 5307 7020 6093 8909 8916
FALSE   3 5917 5714 5002 5793 6933 7583 8200 9203
FALSE   7 5775 5513 4960 5990 7858 7066 8762 8131

Also, we can show the number of cells per donor and predicted cell type at the lowest resolution (l1):

table(d@meta.data$celltype.l1, d@meta.data$donor)
FALSE          
FALSE             P1   P2   P3   P4   P5   P6   P7   P8
FALSE   B        715 1335  925 1296 1798 3229 3472 1030
FALSE   CD4 T   6500 5717 4721 5103 2925 4457 5429 6149
FALSE   CD8 T   3233 3497 1623 3049 3605 1503 3232 5727
FALSE   DC       334  276  250  333  668  379  904  445
FALSE   Mono    5036 3500 4088 5188 8022 6627 9196 7353
FALSE   NK      1590 1610 2423  688 2760 3357 2303 3933
FALSE   other    491  210  180  154  952  506  576  373
FALSE   other T  236 1060  450 1279 1081  684  759 1240

How many distinct cell types are there at each resolution?

length(unique(d@meta.data$celltype.l1))
FALSE [1] 8
length(unique(d@meta.data$celltype.l2))
FALSE [1] 31
length(unique(d@meta.data$celltype.l3))
FALSE [1] 58

Next we will show the 2D UMAP. Cells are color coded according to their cell type predictions resolution level l1 (left UMAP) and l2 (right UMAP):

# Lets show the generated 2D UMAP
UMAPPlot(object = d, 
         reduction = "umap", 
         group.by = 'celltype.l1', 
         pt.size = 0.25)|
  UMAPPlot(object = d, 
         reduction = "umap", 
         group.by = 'celltype.l2', 
         pt.size = 0.25)

Challenges with 2D UMAP based scRNA-seq data exploration

Now that the UMAP contains more than 161,000 cells from a heterogeneous tissue, we start to see the challenges associated with this approach:

  1. massive overplotting

    • limited visibility: the cells (points) form blobs of cells \(\rightarrow\) some points are covered by others
    • even at resolution level l1 we have difficulty to distinguish between cell types and their colors. This is quite more challenging for level l2 and l3 (not shown)
    • increasing the number of cells in the dataset, which is likely to happen as throughput of scRNA-seq techniques improves, will make the problem even worse
    • comparison of 2D UMAPs between studies difficult -> UMAP rotation, colors, cell order (which cell is shown on the top/bottom) affects interpretation
  2. lack of compositional information

    • it is nearly impossible to ascertain the absolute/relative cell frequencies in e.g. cell types, samples, timepoints
    • impossible to tell whether the samples are equally mixed across the cell types or if there are some systematic biases.
    • we could generate 8 x 3 separate UMAPs for each pair of donor and timepoint -> difficult to interpret due to 1. + only major deviations will be noticeable.
  3. qualitative analysis

    • distances between cell types should not be interpreted in Euclidean distance sense -> cell type structure difficult to discern from 2D UMAP

scBubbletree approach

Now lets turn to scBubbletree. As earlier, our first goal is to determine k for the clustering of matrix \(A^{n \times f}\), which represents the low- dimensional feature space of the normalized gene expression matrix. We will use the first \(f\)=15 PCA dimensions as features. We selected \(f\)=15 based on the elbow plot below.

ElbowPlot(object = d, ndims = 50, reduction = "pca")

# This is the main input of scBubbletree 
# -> matrix A with n=cells, f=features (from PCA)
A <- d@reductions$pca@cell.embeddings[, 1:15]
# meta data
meta <- d@meta.data
# quantitative features (gene expressions of marker genes) to be used later on:
# * GNLY, NKG7: NK cells
# * IL7R:   CD4 T cells
# * CD8A:   CD8 T cells
# * MS4A1:  B cells
# * CD14, LYZ:  CD14+ Monocytes
# * FCGR3A, MS4A7:  FCGR3A+ Monocytes
# * FCER1A, CST3:   Dendritic Cells
# * PPBP:   Megakaryocytes
as <- as.matrix(t(d@assays$SCT@data[
  rownames(d@assays$SCT@data) %in% 
    c("IL7R", 
      "CD14", "LYZ", 
      "MS4A1", 
      "CD8A", 
      "GNLY", "NKG7",
      "FCGR3A", "MS4A7",
      "FCER1A", "CST3",
      "PPBP"), ]))
FALSE              used   (Mb) gc trigger    (Mb)   max used   (Mb)
FALSE Ncells    8356105  446.3   12798815   683.6   12798815  683.6
FALSE Vcells 1191408814 9089.8 1777478561 13561.1 1243503274 9487.2

We will once again use the function get_k for clustering. Notice the modified parameter cv_index_p=0.2, for faster execution of this large dataset:

if(redo_case_b) {
  # Determine appropriate number of clusters (k)
  b <- get_k(B = 30,
             cv_clust_p = 1, # use 100% for clustering
             cv_index_p = 0.2, # use 20% for gap stat.
             ks = 1:40,
             x = A,
             n_start = 10,
             iter_max = 20,
             cores = 20)
  
  if(dir.exists("case_study_B")==F) {
    dir.create("case_study_B")
  }
  save(b, file = "case_study_B/b.RData")
} else {
  b <- get(load("~/scBubbletree/case_study_B/b.RData"))
}

We see a more complex silhouette curve. This is normal given the complexity in our dataset, which is composed of many cell types with high and low relative frequencies. We see a maximum silhouette index at k=5. The index then drops sharply until k=11, followed by shallow increase until k=14 after the silhouette starts once again to decrease.

Elbows are not easy to detect from the Gap and WSS curves. It is quite evident, however, that both curves flatten at around k=15 to k=20.

k=5

Lets start with k=5 and use get_bubbletree to perform k-means clustering followed by hierarchical clustering:

if(redo_case_b) {
  # Determine appropriate number of clusters (k)
  k5 <- get_bubbletree(x = A,
                            k = 5,
                            n_start = 100,
                            iter_max = 100,
                            seed = 4321,
                            cores = 5,
                            B = 50,
                            N_eff = 100,
                            verbose = FALSE)
  save(k5, file = "case_study_B/k5.RData")
} else {
  k5 <- get(load("~/scBubbletree/case_study_B/k5.RData"))
}

Next, we can plot the bubbletree:

  • bubbles 3 and 4 are dominant as they contain approximately 70% of the cells
  • now we also see some strcuture in the bubbletree, e.g. bubbles 1 and 3 form one branch with high support, which is then also joined with bubble 2.
  • bubbles 4 and 5 are clearly separated from the branch formed by bubbles 1, 2 and 3
k5$tree

Does this make biological sense? Lets visualize the cell type compositions (figure B, notice parameter feature_composition = F) of the different bubbles using the categorical feature l1. Furthermore, we can also show the composition of features across the different bubbles (figure C, notice parameter feature_composition = T).

Summary of the bubbletree and figure B:

  • the branch with bubbles 1 and 3 contains CD4+ and CD8+ T-cells, NK-cells and other T-cells
  • about 92% of the cells in bubble 2 are B-cells and 7% are DCs. B-cells and DCs are not related, which points at potential undersegmentation.
  • it is evident that the branch with bubble 1,3 and 2 contains different lymphocytes
  • nearly all cells in bubble 4 are monocytes and DCs
  • the smallest bubble 5 is nearly exclusively composed of ‘other’ cells and about 3% monocytes
  • all branches have high (complete) bootstrap support, i.e. in the 50 bootstrap trees we find identical branching as in the consensus bubbletree.
  • in summary, the clustering algorithm with \(k=5\) performs coarse but biologically meaningful cell type segmentation.
# cell type annotations (level 1)
a <- meta$celltype.l1 

# create tile plot showing cluster compositions
w1 <- get_cat_feature_tiles(d = k5,
                            a = a,
                            feature_composition = F,
                            round_digits = 1,
                            show_hclust = F)

# create tile plot showing feature compositions
w2 <- get_cat_feature_tiles(d = k5,
                            a = a,
                            feature_composition = T,
                            round_digits = 1,
                            show_hclust = F)
(k5$tree|w1$w|w2$w)+
  patchwork::plot_layout(widths = c(1, 1.5, 1.5))+
  patchwork::plot_annotation(tag_levels = 'A')

Earlier we stated that the data is derived from 8 donors at 3 timepoints. Are the cells in each sample well mixed in each of the different bubbles?

We see interesting patterns in the feature compositions between the samples as shown in figure D and the corresponding dendrogram with two branches. For instance bubble 3 cells (CD4+ and CD8+ T-cells) appear to be enriched in P4 samples. We do not see systematic biases between samples that correspond to different timepoints.

# lets first create a variable showing pairs (donor x timepoint) 
a <- paste0(meta$donor, '_', meta$time)

# categorical donor meta data
w3 <- get_cat_feature_tiles(d = k5,
                            a = a,
                            feature_composition = T,
                            round_digits = 0,
                            show_hclust = T)
((k5$tree|w1$w|w2$w)+patchwork::plot_layout(widths = c(1, 1.5, 1.5)))/(w3$w)+
  patchwork::plot_annotation(tag_levels = 'A')

Another categorical feature which we can visualized are the predicted phases of the different cells: G1 (cell growth), S (DNA replication), G2 (growth and preparation for mitosis), M (mitosis: ell division occurs). These prediction are included as part of the meta data.

In the next plot we show the composition of each bubble with respect to these three cell phases:

# cell type annotations (level 1)
a <- meta$Phase 

# create tile plot
w4 <- get_cat_feature_tiles(d = k5,
                            a = a,
                            feature_composition = T,
                            round_digits = 1)
((k5$tree|w1$w|w2$w)+patchwork::plot_layout(widths = c(1, 1.5, 1.5)))/
  ((w3$w|w4$w)+patchwork::plot_layout(widths = c(3, 1)))+
  patchwork::plot_annotation(tag_levels = 'A')

k=15

It is apparent from the results generated by setting k=5 that some bubbles are impure, i.e. they contain cells from different cell types, e.g. bubble 2 contains a mixture of two unrelated cell types, i.e. B-cells and DCs.

Furthermore, our goal is to inspect the data at a finer cellular resolution. For instance, lets see how well k=5 clustering segments the l2 cell type predictions?

As expected, many cluster contains multiple cell types \(\rightarrow\) if we want to identify cell clusters at l2 resolution it is evident that we need to increase k. Lets try k=15 (this is also a reasonable value for k based on silhouette, WSS and gap curves shown above).

# cell type annotations (level 1)
a <- meta$celltype.l2 

# create tile plot showing cluster compositions
w1 <- get_cat_feature_tiles(d = k5,
                            a = a,
                            feature_composition = T,
                            round_digits = 0,
                            show_hclust = F)

(k5$tree|w1$w)+
  patchwork::plot_layout(widths = c(1, 4))+
  patchwork::plot_annotation(tag_levels = 'A')

if(redo_case_b) {
  # do clustering with k=15
  k15 <- get_bubbletree(x = A,
                        k = 15,
                        n_start = 200,
                        iter_max = 1000,
                        seed = 4321,
                        cores = 5,
                        B = 50,
                        N_eff = 100,
                        verbose = FALSE,
                        round_digits = 1)
  save(k15, file = "case_study_B/k15.RData")
} else {
  k15 <- get(load("~/scBubbletree/case_study_B/k15.RData"))
}

Summary of the bubbletree:

  • the lymphocyte and monocyte branches of bubbles are well defined. Within branch structures are biologically meaningful
  • bubble 2 is the largest one in the tree and appears to be enriched with naive T-cells and Tregs, but also hematopoietic stem and progenitor cells (HSPC) \(\rightarrow\) additional clustering might be necessary
  • other T-cells, such as effector, central memory and cytotoxic T-cells, are enriched in the remaining bubbles 1, 8, 14 of the same branch.
  • NK-cells are contained in bubble 7, shown as an outgroup of lymphocytes
  • B-cells (naive, memory, intermediate and plasmablasts) are found in a common branch containing bubble 3 and 13
  • CD14+ and CD16+ monocytes and different types of DCs are contained in a separate branch of monocyte bubbles. Surprisingly, these bubbles are also enriched with doublets
  • most branches have high (complete) bootstrap support. This is not the case for the branch connecting bubble 8 and 14 (50% support).
  • cells from smaller PBMCs subtypes, such as platelets and erythrocytes, are found as separate bubbles shown as outgroups
  • in summary, the clustering algorithm with \(k=15\) recovers most l2 resolution cell types and produces a biologically meaningful bubbletree topology
# cell type annotations (level 2)
a <- meta$celltype.l2

# create tile plot
w1 <- get_cat_feature_tiles(d = k15,
                            a = a,
                            feature_composition = T,
                            round_digits = 0,
                            show_hclust = F)

(k15$tree|w1$w)+patchwork::plot_layout(ncol = 2, widths = c(3, 9))

Interestingly, the bubbles 1 and 2 contain many different cell types. Additional clustering of these bubbles might be necessary.

Bubble update

scBubbletree can perform targeted clustering of cells within specific bubbles. This can be achieved with the function update_bubbletree. Lets update bubbles 1 and 2 into two sub-bubbles.

Important remark: the function update_bubbletree should be used to explore the clustering in order to select an appropriate k and to explain the segmentation.

if(redo_case_b) {
  # do clustering with k=15
  u_k15 <- update_bubbletree(btd = k15,
                             updated_bubbles = c("2", "1"),
                             ks = c(2, 2),
                             cores = 20)
  save(u_k15, file = "case_study_B/u_k15.RData")
} else {
  u_k15 <- get(load("~/scBubbletree/case_study_B/u_k15.RData"))
}
# create tile plot
w2 <- get_cat_feature_tiles(d = u_k15,
                            a = a,
                            feature_composition = T,
                            round_digits = 0,
                            show_hclust = F)

(u_k15$tree|w2$w)+patchwork::plot_layout(ncol = 2, widths = c(4, 9))

Gini index

Can we show quantitatively that by increasing k we get “better” clustering in a semi-supervised way? Yes, we can use the Gini impurity index.

For this we will integrate the results obtained by function get_k with l1, l2 and l3 cell type predictions (labels), and show quantitatively the change in Gini index as a function of k. This is done with the function get_gini_boot. One potential caveat of this approach is that some of the predictions might be inaccurate.

Lets invoke the function get_gini_boot and pass the object b obtained earlier together with the cell type predictions:

b <- get(load("~/scBubbletree/case_study_B/b.RData"))
gini_l1 <- get_gini_boot(labels = meta$celltype.l1,
                         kmeans_boot_obj = b)
gini_l2 <- get_gini_boot(labels = meta$celltype.l2,
                         kmeans_boot_obj = b)
gini_l3 <- get_gini_boot(labels = meta$celltype.l3,
                         kmeans_boot_obj = b)

We next plot the Gini scores for the labels at resolution l1, l2 and l3.

Summary of the above plot:

  • more abstract cell types can be segmented more accurately and the resulting segmentation has lower total Gini impurity.
  • in each curve the total Gini impurity appears to decrease slowly for as k approaches 20.
  • this indicates that our previous data-based indices provide meaningful results
  • important: finding the right \(k\) also depends on the user’s objectives

Quantiative features

Gene expression annotations can also be integrated to better explain the bubbles and the tree structure. Lets visualize the mean gene expression of some marker genes in the bubbles:

  • GNLY, NKG7: NK cells
  • IL7R: CD4 T cells
  • CD8A: CD8 T cells
  • MS4A1: B cells
  • CD14, LYZ: CD14+ Monocytes
  • FCGR3A, MS4A7: FCGR3A+ Monocytes
  • FCER1A, CST3: Dendritic Cells
  • PPBP: Megakaryocytes
# This function build the nummeric annotations plot
w3 <- get_num_feature_tiles(d = k15,
                            as = as,
                            plot_title = "",
                            round_digits = 1)

# Plot
(k15$tree|w3$w)+patchwork::plot_layout(widths = c(1, 1))

Second, we can visualize the distribution of each marker gene in each bubble using violin plots with get_num_feature_violins. This function uses the same input as get_num_feature_tiles.

Lets invoke this function now.

w4 <- get_num_feature_violins(d = k15,
                              as = as,
                              plot_title = "",
                              scales = 'free_x')


((k15$tree|w3$w)+patchwork::plot_layout(widths = c(1.5, 3)))/w4$w

References


  1. Tian, Luyi, et al. “Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.” Nature methods 16.6 (2019): 479-487.↩︎

  2. Hao, Yuhan, et al. “Integrated analysis of multimodal single-cell data.” Cell 184.13 (2021): 3573-3587.↩︎

  3. https://satijalab.org/seurat/articles/multimodal_reference_mapping.html↩︎